{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3.微调的应用场景 Where Finetuning Fits In  \n",
    "在这一节中，我们会了解微调真正融入训练流程的地方。微调实际上是在“预训练”这一步骤后！"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 目录\n",
    "- [ 3.1 - 实验前了解一些知识](#3.1)\n",
    "- [ 3.2 - 实验](#3.2)\n",
    "  - [3.2.1 - 查看预训练数据集](#3.2.1)\n",
    "  - [3.2.2 - 我们将使用的公司微调数据对比](#3.2.2)\n",
    "  - [3.2.3 - 不同格式化数据的方式](#3.2.3)\n",
    "  - [3.2.4 - 储存数据的常见方式](#3.2.4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name='3.1'></a>\n",
    "# 3.1 实验前了解一些知识"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这是微调发生之前的第一步。它一开始采用的是完全随机的模型。它对世界一无所知。所以所有的权值，如果我们熟悉权值的话，都是完全随机的。它根本不能构成英语单词。它还没有语言技能。它的学习目标是下一个 `token` 预测。或者，我们知道，在一个简化的意义上，它只是下一个词预测。所以我们看到了单词 `once`，所以我们现在想让它预测单词 `upon`。然后我们看到 LLM 只是 `sd!!@`，离 `upon` 这个词很远。  \n",
    "  \n",
    "这就是开始的地方。但它从一个巨大的数据语料库中获取和读取数据，这些数据通常是从整个网络上抓取的。我们通常称它为未标记的，因为它不是我们一起构建的。我们只是从网上得来的。这其中经历了很多很多的数据清洗过程。因此，要使这个数据集有效地用于模型预训练，还有很多手工工作要做。这通常被称为自监督学习，因为模型本质上是通过下一个 `token` 预测来监督自己。它所要做的就是预测下一个单词。除此之外并没有真正的标签。  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![1](../../figures/FLLM-3-1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在，经过训练，我们可以看到这个模型现在能够预测单词 `upon`，或者标记 `upon`。这是习得的语言。它从网上学到了很多知识。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![2](../../figures/FLLM-3-2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这背后的实际理解和知识往往不是很公开。人们并不确切知道大公司的闭源模型的数据集是什么样的。但是 `Eleuther AI` 做了一个了不起的开源项目，其创建了一个叫做 `The Pile` 的数据集（`Deeplearning.AI` 原来在后面的代码中用到了这个数据集，但由于这个数据集移位，所以他们换了个数据集），我们会在后面代码里探索它。它是一组从整个互联网上搜下来的22个不同的数据集。这里你可以看到这个图，你知道，`four score and seven years`。这就是林肯的葛底斯堡演说。还有林肯的 `carrot cake` 菜谱。当然，也有从 `PubMed` 中抓取的有关于不同医学文献的信息。最后，这里还有来自GitHub的代码。所以这是一组非常聪明的数据集，这些数据集被整理在一起，为这些模型注入了知识。现在，这个预训练步骤是非常昂贵和耗时的。这实际上是很昂贵的，因为要让这个模型浏览所有这些数据，从绝对随机到理解其中的一些文本，要知道，它要一边写代码，一边了解医学和葛底斯堡演说。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![3](../../figures/FLLM-3-3.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下图中，输入“墨西哥的首都是？”，输出却明显不对。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![4](../../figures/FLLM-3-4.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "正如我们所看到的，从聊天机器人界面的意义上看，它并不是真正有用的。那么，如何将信息导入聊天机器人界面呢?嗯，微调是让我们到达那里的方法之一。它应该是我们工具箱里的一个工具。所以预训练是获得基本模型的第一步。当我们加入更多的数据，实际上不是很多的数据，我们可以使用微调来得到一个微调的模型。实际上，即使是一个微调过的模型，我们也可以在之后继续添加微调步骤。所以微调是之后的一个步骤。我们可以使用相同类型的数据。实际上我们可以从不同的来源收集数据并将其整合在一起，我们会看到一点。这就是未标记的数据。但是，我们也可以自己管理数据，使其更加结构化，以供模型学习。我们认为将微调与预训练区分开来的一个关键是所需的数据要少得多。我们在这个已经学习了很多知识和基本语言技能的基本模型的基础上建立，我们只是把它带到下一个层次。我们不需要那么多数据。所以这确实是我们工具箱里的一个工具。如果你来自其他机器学习领域，这可以是对判别任务的微调。也许我们正在处理图像，并且对ImageNet进行了微调。我们会发现这里微调的定义有点松散。对于生成任务来说，它的定义不太好因为我们实际上是在更新整个模型的权重，而不仅仅是其中的一部分，这通常是对其他类型的模型进行微调的情况。所以我们的微调训练目标和这里的预训练目标是一样的，生成下一个 `token`。我们所做的就是改变数据，使它在某种程度上更加结构化。模型在输出和模拟该结构时可以更加一致。而且，还有更高级的方法来减少你想要更新这个模型的次数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![5](../../figures/FLLM-3-5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "那么微调对我们有什么好处呢？我们现在对它是什么有了一个感觉，但我们实际上可以用它做什么不同的任务呢？我喜欢考虑的一个大类是行为改变。我们在改变模型的行为。我们确切地告诉它，在这个聊天界面中，我们现在处于聊天设置中。这使得模型的反应更加一致。这意味着模型可以更好地聚焦。例如，这可能对适度 (`moderation`) 更好。它通常也只是在梳理自己的能力。所以在这里，它更擅长对话。所以它现在可以谈论各种各样的事情而不是以前我们必须做很多及时的工程来梳理这些信息。微调还可以帮助模型获得新的知识。所以这可能是围绕特定的主题不在基础预训练模型中。这可能意味着纠正以前不正确的信息。所以也许有更多最新的信息是我们想要模型实际注入的。当然，更常见的是，这两个模型都要用到。所以通常，我们在改变行为，我们想让它获得新的知识。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![6](../../figures/FLLM-3-6.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "微调要完成的任务可以是抽取文本信息，也可以是扩展文本，如根据我们提供的基础信息写邮件。当我们更清楚我们想要完成的任务是什么的时候，微调的效果会更好。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![7](../../figures/FLLM-3-7.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果这是你第一次微调模型，可以遵守下面的步骤：  \n",
    "1. 通过 prompt 工程首先识别可以完成的任务\n",
    "2. 寻找我们觉得大模型做的还不错的任务\n",
    "3. 挑选一个任务\n",
    "4. 获取对应任务的，可以是1000对输入输出（要比原先大模型表现要好）\n",
    "5. 用这些数据微调一个小模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![8](../../figures/FLLM-3-8.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name='3.2'></a>\n",
    "## 3.2 实验  \n",
    "接下来我们来看看原有大模型和微调后的效果区别："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入相应库\n",
    "import jsonlines\n",
    "import itertools\n",
    "import pandas as pd\n",
    "# pprint()函数作用是格式化打印输出对象，使输出更加规整美观\n",
    "from pprint import pprint\n",
    "\n",
    "import datasets\n",
    "from datasets import load_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name='3.2.1'></a>\n",
    "### 3.2.1 查看预训练数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入数据集  https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00000-of-01024.json.gz\n",
    "pretrained_dataset = load_dataset(\"./c4-train.00000-of-01024.json/\", \"en\", split=\"train\", streaming=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pretrained dataset:\n",
      "{'text': 'Beginners BBQ Class Taking Place in Missoula!\\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.', 'timestamp': datetime.datetime(2019, 4, 25, 12, 57, 54), 'url': 'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/'}\n",
      "{'text': 'Discussion in \\'Mac OS X Lion (10.7)\\' started by axboi87, Jan 20, 2012.\\nI\\'ve got a 500gb internal drive and a 240gb SSD.\\nWhen trying to restore using disk utility i\\'m given the error \"Not enough space on disk ____ to restore\"\\nBut I shouldn\\'t have to do that!!!\\nAny ideas or workarounds before resorting to the above?\\nUse Carbon Copy Cloner to copy one drive to the other. I\\'ve done this several times going from larger HDD to smaller SSD and I wound up with a bootable SSD drive. One step you have to remember not to skip is to use Disk Utility to partition the SSD as GUID partition scheme HFS+ before doing the clone. If it came Apple Partition Scheme, even if you let CCC do the clone, the resulting drive won\\'t be bootable. CCC usually works in \"file mode\" and it can easily copy a larger drive (that\\'s mostly empty) onto a smaller drive. If you tell CCC to clone a drive you did NOT boot from, it can work in block copy mode where the destination drive must be the same size or larger than the drive you are cloning from (if I recall).\\nI\\'ve actually done this somehow on Disk Utility several times (booting from a different drive (or even the dvd) so not running disk utility from the drive your cloning) and had it work just fine from larger to smaller bootable clone. Definitely format the drive cloning to first, as bootable Apple etc..\\nThanks for pointing this out. My only experience using DU to go larger to smaller was when I was trying to make a Lion install stick and I was unable to restore InstallESD.dmg to a 4 GB USB stick but of course the reason that wouldn\\'t fit is there was slightly more than 4 GB of data.', 'timestamp': datetime.datetime(2019, 4, 21, 10, 7, 13), 'url': 'https://forums.macrumors.com/threads/restore-from-larger-disk-to-smaller-disk.1311329/'}\n",
      "{'text': 'Foil plaid lycra and spandex shortall with metallic slinky insets. Attached metallic elastic belt with O-ring. Headband included. Great hip hop or jazz dance costume. Made in the USA.', 'timestamp': datetime.datetime(2019, 4, 25, 10, 40, 23), 'url': 'https://awishcometrue.com/Catalogs/Clearance/Tweens/V1960-Find-A-Way'}\n",
      "{'text': \"How many backlinks per day for new site?\\nDiscussion in 'Black Hat SEO' started by Omoplata, Dec 3, 2010.\\n1) for a newly created site, what's the max # backlinks per day I should do to be safe?\\n2) how long do I have to let my site age before I can start making more blinks?\\nI did about 6000 forum profiles every 24 hours for 10 days for one of my sites which had a brand new domain.\\nThere is three backlinks for every of these forum profile so thats 18 000 backlinks every 24 hours and nothing happened in terms of being penalized or sandboxed. This is now maybe 3 months ago and the site is ranking on first page for a lot of my targeted keywords.\\nbuild more you can in starting but do manual submission and not spammy type means manual + relevant to the post.. then after 1 month you can make a big blast..\\nWow, dude, you built 18k backlinks a day on a brand new site? How quickly did you rank up? What kind of competition/searches did those keywords have?\", 'timestamp': datetime.datetime(2019, 4, 21, 12, 46, 19), 'url': 'https://www.blackhatworld.com/seo/how-many-backlinks-per-day-for-new-site.258615/'}\n",
      "{'text': 'The Denver Board of Education opened the 2017-18 school year with an update on projects that include new construction, upgrades, heat mitigation and quality learning environments.\\nWe are excited that Denver students will be the beneficiaries of a four year, $572 million General Obligation Bond. Since the passage of the bond, our construction team has worked to schedule the projects over the four-year term of the bond.\\nDenver voters on Tuesday approved bond and mill funding measures for students in Denver Public Schools, agreeing to invest $572 million in bond funding to build and improve schools and $56.6 million in operating dollars to support proven initiatives, such as early literacy.\\nDenver voters say yes to bond and mill levy funding support for DPS students and schools. Click to learn more about the details of the voter-approved bond measure.\\nDenver voters on Nov. 8 approved bond and mill funding measures for DPS students and schools. Learn more about what’s included in the mill levy measure.', 'timestamp': datetime.datetime(2019, 4, 20, 14, 33, 21), 'url': 'http://bond.dpsk12.org/category/news/'}\n"
     ]
    }
   ],
   "source": [
    "# 查看数据集（同时也是一个迭代器）的前n个元素\n",
    "n = 5\n",
    "print(\"Pretrained dataset:\")\n",
    "# 语法itertools.islice(iterable, start, stop, step) 并返回一个迭代器对象\n",
    "top_n = itertools.islice(pretrained_dataset, n)\n",
    "for i in top_n:\n",
    "  print(i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name='3.2.2'></a>\n",
    "### 3.2.2 和我们将使用的公司微调数据对比"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question</th>\n",
       "      <th>answer</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>What are the different types of documents avai...</td>\n",
       "      <td>Lamini has documentation on Getting Started, A...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What is the recommended way to set up and conf...</td>\n",
       "      <td>Lamini can be downloaded as a python package a...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>How can I find the specific documentation I ne...</td>\n",
       "      <td>You can ask this model about documentation, wh...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Does the documentation include explanations of...</td>\n",
       "      <td>Our documentation provides both real-world and...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Does the documentation provide information abo...</td>\n",
       "      <td>External dependencies and libraries are all av...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1395</th>\n",
       "      <td>What is Lamini and what is its collaboration w...</td>\n",
       "      <td>Lamini is a library that simplifies the proces...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1396</th>\n",
       "      <td>How does Lamini simplify the process of access...</td>\n",
       "      <td>Lamini simplifies data access in Databricks by...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1397</th>\n",
       "      <td>What are some of the key features provided by ...</td>\n",
       "      <td>Lamini automatically manages the infrastructur...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1398</th>\n",
       "      <td>How does Lamini ensure data privacy during the...</td>\n",
       "      <td>During the training process, Lamini ensures da...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1399</th>\n",
       "      <td>Can you provide an example use case where Lami...</td>\n",
       "      <td>An example use case where Lamini outperforms C...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1400 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               question  \\\n",
       "0     What are the different types of documents avai...   \n",
       "1     What is the recommended way to set up and conf...   \n",
       "2     How can I find the specific documentation I ne...   \n",
       "3     Does the documentation include explanations of...   \n",
       "4     Does the documentation provide information abo...   \n",
       "...                                                 ...   \n",
       "1395  What is Lamini and what is its collaboration w...   \n",
       "1396  How does Lamini simplify the process of access...   \n",
       "1397  What are some of the key features provided by ...   \n",
       "1398  How does Lamini ensure data privacy during the...   \n",
       "1399  Can you provide an example use case where Lami...   \n",
       "\n",
       "                                                 answer  \n",
       "0     Lamini has documentation on Getting Started, A...  \n",
       "1     Lamini can be downloaded as a python package a...  \n",
       "2     You can ask this model about documentation, wh...  \n",
       "3     Our documentation provides both real-world and...  \n",
       "4     External dependencies and libraries are all av...  \n",
       "...                                                 ...  \n",
       "1395  Lamini is a library that simplifies the proces...  \n",
       "1396  Lamini simplifies data access in Databricks by...  \n",
       "1397  Lamini automatically manages the infrastructur...  \n",
       "1398  During the training process, Lamini ensures da...  \n",
       "1399  An example use case where Lamini outperforms C...  \n",
       "\n",
       "[1400 rows x 2 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# .jsonl是一种常见的JOSN行格式化文件拓展名，即JOSN Lines。而且.jsonal是一种优化了的JSON格式，非常适合存储、读取大量结构化数据\n",
    "# 读取.jsonl文件，并查看\n",
    "filename = \"lamini_docs.jsonl\"\n",
    "instruction_dataset_df = pd.read_json(filename, lines=True)\n",
    "instruction_dataset_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name='3.2.3'></a>\n",
    "### 3.2.3 不同格式化数据的方式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"What are the different types of documents available in the repository (e.g., installation guide, API documentation, developer's guide)?Lamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/.\""
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 使上述数据两列的“问题”和“答案”之间形成字典，并查看连接后的第一个文本\n",
    "examples = instruction_dataset_df.to_dict()\n",
    "text = examples[\"question\"][0] + examples[\"answer\"][0]\n",
    "text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将不同对应的“对子”连起来\n",
    "if \"question\" in examples and \"answer\" in examples:\n",
    "  text = examples[\"question\"][0] + examples[\"answer\"][0]\n",
    "elif \"instruction\" in examples and \"response\" in examples:\n",
    "  text = examples[\"instruction\"][0] + examples[\"response\"][0]\n",
    "elif \"input\" in examples and \"output\" in examples:\n",
    "  text = examples[\"input\"][0] + examples[\"output\"][0]\n",
    "else:\n",
    "  text = examples[\"text\"][0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 构造一个 prompt 模板（包含“问题”和“答案”）\n",
    "prompt_template_qa = \"\"\"### Question:\n",
    "{question}\n",
    "\n",
    "### Answer:\n",
    "{answer}\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"### Question:\\nWhat are the different types of documents available in the repository (e.g., installation guide, API documentation, developer's guide)?\\n\\n### Answer:\\nLamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/.\""
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 将第一个文本的“问题”和“答案”填进 prompt 模板\n",
    "question = examples[\"question\"][0]\n",
    "answer = examples[\"answer\"][0]\n",
    "\n",
    "text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer)\n",
    "text_with_prompt_template"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 构造一个 prompt 模板（仅包含“问题”）\n",
    "prompt_template_q = \"\"\"### Question:\n",
    "{question}\n",
    "\n",
    "### Answer:\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 依据字典键值对长度，依次将“问题”和“答案”填充进两个模板中\n",
    "num_examples = len(examples[\"question\"])\n",
    "finetuning_dataset_text_only = []\n",
    "finetuning_dataset_question_answer = []\n",
    "for i in range(num_examples):\n",
    "  question = examples[\"question\"][i]\n",
    "  answer = examples[\"answer\"][i]\n",
    "\n",
    "  text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer)\n",
    "  finetuning_dataset_text_only.append({\"text\": text_with_prompt_template_qa})\n",
    "\n",
    "  text_with_prompt_template_q = prompt_template_q.format(question=question)\n",
    "  finetuning_dataset_question_answer.append({\"question\": text_with_prompt_template_q, \"answer\": answer})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'text': '### Question:\\n'\n",
      "         'What are the different types of documents available in the '\n",
      "         \"repository (e.g., installation guide, API documentation, developer's \"\n",
      "         'guide)?\\n'\n",
      "         '\\n'\n",
      "         '### Answer:\\n'\n",
      "         'Lamini has documentation on Getting Started, Authentication, '\n",
      "         'Question Answer Model, Python Library, Batching, Error Handling, '\n",
      "         'Advanced topics, and class documentation on LLM Engine available at '\n",
      "         'https://lamini-ai.github.io/.'}\n"
     ]
    }
   ],
   "source": [
    "# 打印查看文本模板下列表的第一个元素\n",
    "pprint(finetuning_dataset_text_only[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'answer': 'Lamini has documentation on Getting Started, Authentication, '\n",
      "           'Question Answer Model, Python Library, Batching, Error Handling, '\n",
      "           'Advanced topics, and class documentation on LLM Engine available '\n",
      "           'at https://lamini-ai.github.io/.',\n",
      " 'question': '### Question:\\n'\n",
      "             'What are the different types of documents available in the '\n",
      "             'repository (e.g., installation guide, API documentation, '\n",
      "             \"developer's guide)?\\n\"\n",
      "             '\\n'\n",
      "             '### Answer:'}\n"
     ]
    }
   ],
   "source": [
    "# 打印查看另一个模板下列表的第一个元素\n",
    "pprint(finetuning_dataset_question_answer[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name='3.2.4'></a>\n",
    "### 3.2.4 储存数据的常见方式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 比如存储第二个模板下的内容为.jsonl格式，每一行都是json格式\n",
    "with jsonlines.open(f'lamini_docs_processed.jsonl', 'w') as writer:\n",
    "    writer.write_all(finetuning_dataset_question_answer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DatasetDict({\n",
      "    train: Dataset({\n",
      "        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],\n",
      "        num_rows: 1260\n",
      "    })\n",
      "    test: Dataset({\n",
      "        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],\n",
      "        num_rows: 140\n",
      "    })\n",
      "})\n"
     ]
    }
   ],
   "source": [
    "# 如果我们之前将数据上传至 Hugging Face，那么我们可以下面这样拉取数据\n",
    "finetuning_dataset_name = \"lamini/lamini_docs\"\n",
    "finetuning_dataset = load_dataset(finetuning_dataset_name)\n",
    "print(finetuning_dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
